Windows Azure: Building a Secure Backup System (part 4)

10/22/2010 9:19:50 AM

4.4. Compressing Backup Data

Now that the key generation part is out of the way, let’s write the tool that generates the backup itself. You’ll be writing the code for this in a file called azbackup.py.

Users will pass in directories to back up to this little tool. You have several ways of dealing with this input. One valid technique is to encrypt every file separately. However, this quickly becomes a hassle, especially if you have thousands of files to deal with. Thankfully, the Unix world has had experience in doing this sort of thing for a few decades.

Note: One of the earliest references to tar is from Seventh Edition Unix in 1979. This is a descendant of the tap program that shipped with First Edition Unix in 1971.

Backups are typically done in a two-step process. The first step is to gather all the input files into a single file, typically with the extension .tar. The actual file format is very straightforward: the files are concatenated together with a short header before each one.

Managing a single file makes your life easier because it enables you to manipulate this with a slew of standard tools. Authoring scripts becomes a lot easier. Compressing the file gives much better results (because the compression algorithm has a much better chance of finding patterns and redundancies). The canonical way to create a tar file out of a directory and then compress the output using the gzip algorithm is with the following command:

tar -cvf output.tar inputdirectory|gzip >output.tar.gz

Or you could use the shortcut version:

tar -cvzf output.tar.gz inputdirectory

To get back the original contents, the same process is done in reverse. The tarred, gzipped file is decompressed, and then split apart into its constituent files.

Example 5 shows the code to do all of this inside azbackup.py. There are two symmetric functions: generate_tar_gzip and extract_tar_gzip. The former takes a directory or a file to compress, and writes out a tarred, gzipped archive to a specified output filename. The latter does the reverse—it takes an input archive and extracts its content to a specified directory. The code takes advantage of the tarfile module that ships with Python, and adds support for all this.

Example 5. Compressing and extracting archives

import tarfile

def generate_tar_gzip(directory_or_file, output_file_name):
    if directory_or_file.endswith("/"):
        directory_or_file = directory_or_file.rstrip("/")
    # We open a handle to an output tarfile. The 'w:gz'
    # specifies that we're writing to it and that gzip
    # compression should be used
    out = tarfile.TarFile.open(output_file_name, "w:gz")

    # Add the input directory to the tarfile. Arcname is the
    # name of the directory 'inside' the archive. We'll just reuse the name
    # of the file/directory here
    out.add(directory_or_file, arcname = os.path.basename(directory_or_file))
    out.close()

def extract_tar_gzip(archive_file_name, output_directory):
    # Open the tar file and extract all contents to the
    # output directory
    extract = tarfile.TarFile.open(archive_file_name)
    extract.extractall(output_directory)
    extract.close()

4.5. Encrypting Data

azbackup will use the following three-step process to encrypt data (with “data” here being the compressed archives generated from the previous step):

For every archive, it’ll generate a unique 256 key. Let’s call this key K_sym.
K_sym is used to encrypt the archive using AES-256 in CBC mode. (You’ll learn what “CBC” means in just a bit.)
K_sym is encrypted by the user’s RSA encryption key (K_enc) and attached to the encrypted data from the previous step.

Example 6 shows the code in the crypto module corresponding to the previously described three steps.

Example 6. Encrypting data

def generate_rand_bits(bits=32*8):
    """SystemRandom is a cryptographically strong source of randomness
     Get n bits of randomness"""

    import random
    sys_random = random.SystemRandom()
    return long_as_bytes(sys_random.getrandbits(bits), bits/8)

def long_as_bytes(lvalue, width):
    """This rather dense piece of code takes a long and splits it apart into a
    byte array containing its constituent bytes with least significant byte
    first"""

    fmt = '%%.%dx' % (2*width)
    return unhexlify(fmt % (lvalue & ((1L<<8*width)-1)))


def block_encrypt(data, key):
    """ High level function which takes data and key as parameters
        and turns it into
        IV + CipherText after padding. Note that this still needs a sig added
        At the end"""
    iv = generate_rand_bits(32 * 8)
    ciphertext = aes256_encrypt_data(data, key, iv)

    return iv + ciphertext


def aes256_encrypt_data(data, key, iv):
    """ Takes data, a 256-bit key and a IV and
    encrypts it. Encryption is done
    with AES 256 in CBC mode. Note that OpenSSL is doing
    the padding for us"""
    enc =1
    cipher = EVP.Cipher('aes_256_cbc', key,iv , enc,0)

    pbuf = cStringIO.StringIO(data)
    cbuf = cStringIO.StringIO()

    ciphertext = aes256_stream_helper(cipher, pbuf, cbuf)
    pbuf.close()
    cbuf.close()
    return ciphertext



def aes256_stream_helper(cipher, input_stream, output_stream):

    while True:
        buf = input_stream.read()
        if not buf:
            break
        output_stream.write(cipher.update(buf))
    output_stream.write(cipher.final())
    return output_stream.getvalue()

def encrypt_rsa(rsa_key, data):
    return rsa_key.public_encrypt(data, RSA.pkcs1_padding)

That was quite a bit of code, so let’s break down what this code does.

4.5.1. Generating a unique K_sym

The work of generating a random, unique key is done by generate_rand_bits. This takes the number of bits to generate as a parameter. In this case, it’ll be called with 256 because you are using AES-256. You call through to Python’s random.SystemRandom to get a cryptographically strong random number.

Note: It is important to use this rather than the built-in random-number generator—cryptographically strong random-number generators have a number of important security properties that make them difficult to predict. Using Python’s built-in random-number generator will cause an instant security vulnerability because an attacker can predict the key and decrypt data. As you can imagine, this is a common mistake, made even by reputable software vendors.

Where does this cryptographically strong random-number generator come from? In this case, Python lets the operating system do the heavy lifting. On Unix this will call /dev/urandom, while on Windows this will call CryptGenRandom. These are both valid (and, in fact, the recommended) means of getting good random numbers.

4.5.2. Encrypting using AES-256

After generating a unique K_sym, the next step is to encrypt data using AES-256. The “256” here refers to the block size. AES is a block cipher—it takes a block of size n (256, in this case) and a key of length n, and then converts into ciphertext of length n. The obvious problem here is that the data is somewhat longer than 256 bits.

Not surprisingly, there are several mechanisms to deal with this, and they are called modes of operation. In this particular case, the chosen mode is cipherblock chaining (CBC). Figure 4 shows how this mode works. The incoming data (plaintext) is split into block-size chunks. Each block of plaintext is XORed with the previous ciphertext block before being encrypted.

Figure 4. CBC encryption

Why not just encrypt every block separately and concatenate all the ciphertext?

This is what the Electronic Codebook (ECB) mode does, and it is very insecure. Typically, input data will have lots of well-known structures (file format headers, whitespace, and so on). Since each block encrypts to the same output ciphertext, the attacker can look for repeating forms (the encrypted versions of the aforementioned structure) and glean information about the data. CBC (the technique used here) prevents this attack because the encrypted form of every block also depends on the blocks that come before it.

This still leaves some vulnerability. Since the first few blocks of data can be the same, the attacker can spot patterns in the beginning of the data stream. To avoid this, the block cipher takes an initialization vector (IV). This is a block filled with random data that the cipher will use as the “starting block.” This ensures that any pattern in the beginning input data is undetectable.

This data doesn’t need to be secret and, in fact, is usually added to the encrypted data in some form so that the receiver knows what IV was used. However, it does need to be different for each archive, and it can never be reused with the same key. In this sample code, you generate IVs the same way you generate random keys: by making a call to generate_rand_bits.

Note: Reusing the same IV is typically a “no-no.” Bad usage of IVs is the core reason Wireless Encryption Protocol (WEP) is considered insecure.

The core of the encryption work is done in aes256_encrypt_data. This takes the input plaintext, K_sym, and a unique IV. It creates an instance of the EVP.CipherEVP.Cipher class is best used in a streaming mode. The little helper method aes256_stream_helper does exactly this. It takes the cipher, the input data stream, and an output stream as parameters. It writes data into the cipher object, and reads the ciphertext into the output stream. class and specifies that it wants to use AES-256 in CBC mode. The

Note: Again, these techniques can be used on any major programming platform. In .NET, AES is supported through the System.Security.Cryptography.Rijndael class.

All this is wrapped up by block_encrypt, which makes the actual call to generate the IV, encrypts the incoming data, and then returns a concatenated version of the encrypted data and the IV.

4.5.3. Encrypting K_sym using K_enc

The final step is to encrypt K_sym using K_enc. Since this is an RSA key pair, this encryption is done with the public key portion. RSA is sensitive to the size and structure of the input data, so the encryption is done using a well-known padding scheme supported by OpenSSL.

The actual encryption is done by encrypt_rsa. This takes an RSA key pair as a parameter (which, in this case, is a type in the M2Crypto package) and calls a method on that object to encrypt the input data.

Note: The fact that you’re using only the public key portion to encrypt is significant. Though no support for this was added as of this writing, a different key format can be implemented that separates the public key from the private key, and puts them in different files. The encryption code must have access only to the public key, and thus can run from insecure machines.

At the end of this process, you now have encrypted data and an encrypted key in a large byte array that can then be uploaded to the cloud.